Module 5
Module 5
MODULE 5
                    PARALLEL MODELS, LANGUAGES, AND COMPILERS
 Parallel programming models
 Programming Model: A collection of program abstractions providing a simplified and transparent view of
 the computer hardware/software system to the programmer.
1. Shared-Variable Model
 Resources in Programming Systems
         o ACTIVE RESOURCES – processors
         o PASSIVE RESOURCES – memory and I/O devices
 Processes
         o A processes are the basic computational units in parallel program
         o A program is a collection of processes
         o Parallelism depends on implementation of IPC (inter-process communication)
 Fundamentals issues around parallel programming
         o Specification, Creation, Suspension, Reactivation, Migration, Termination and
         Synchronization of concurrent processes
         o Process address space may be shared or restricted by limiting scope and access rights
 Multiprocessor programming is based on the use of shared variables in a common memory for IPC.
 Main Issues:
             Protected access of critical sections
             Memory Consistency
             Atomicity of Memory Operations
             Fast synchronization
             Shared data structures
             Fast data movement techniques
      Scheduling of divided program modules on parallel processors is much more complicated than
       scheduling of sequential programs on a uniprocessor.
    Static Scheduling: is conducted at post compile time. Its advantage is low overhead but the
       shortcoming is potentially poor resource utilization.
    Dynamic Scheduling: Dynamic scheduling catches the run-time conditions.
            However, dynamic scheduling requires fast context switching, preemption, and much
               more OS support.
            The advantages of dynamic scheduling include better resource utilization at the expense
               of higher scheduling overhead.
Cache Coherence and Protection
3. Data-Parallel Model
     SIMD Lockstep operations
     Parallelism explicitly handled by hardware synchronization and flow control
     Synchronization done at compile time rather than at run time
     Data parallel programs require the use of pre-distributed data sets
     Choice of parallel data structure is an important consideration
     Emphasis on local computation and data routing operations Can be applicable to fine-grain
       problems
     Implemented either on SIMD or SPMD
     Leads to high degree of parallelism involving thousands of data operations concurrently
   Data Parallelism
     SIMD Characterization for data parallelism
            Scalar operations and scalar data operands
    Chang and Smith (1990) classified language features for parallel programming into 6
     classes according to functionality:
    Optimization features
    Availability features
    Synchronization/Communication features
    Purpose:
          Enhance user-friendliness
          Make the language portable to a large class of parallel computers
          Expand applicability of software libraries
    Scalability: in terms of no. of processors available
    Independence: From hardware topology
    Compatibility: With an established sequential language
    Portability: across shared-memory and message-passing computers
Synchronization/Communication features
      Purpose: How data are accessed and distributed in SIMD or MIMD computers
      Runtime automatic Decomposition: Express C
      Mapping Specification: DINO
      Virtual Processor Support :PISCES 2 and DINO
      Direct Access to Shared Data: Linda
    Where each ei is an arithmetic expression that must produce a scalar integer value.
    The first expression e1 is a lower bound, the second e2 an upper bound, and the third e3 an
     increment (stride).
    For example, B(l : 4 : 3, 6 : 8 : 2, 3) represents four elements B(l, 6, 3), B(4 ,6, 3), B(l, 8,
     3), and B(4, 8, 3) of a three-dimensional array.
    When the third expression in a triplet is missing, a unit stride is assumed.
    The * notation in the second expression indicates all elements in that dimension starting
     from el, or the entire dimension if e1 is also omitted.
    When both e2 and e3 are omitted, the e1 alone represents a single element in that
     dimension. For example, A (5) represents the fifth element in the array A(3 : 7 : 2).
    Array assignments are permitted under the following constraints:
    The array expression on the right must have the same shape and the same number of
     elements as the array on the left.
    For example, the assignment A(2 : 4, 5 : 8) =A(3 : 5, 1 : 4)
    When a scalar is assigned to an array, the value of the scalar is assigned to every element
     of the array.
    For instance, the statement B(3 : 4, 5) = 0 sets B(3, 5) and B(4, 5) to 0.
(Doall, Endall )
    To declare parallel activities, we use the (Doall, Endall) pair.
    All iterations in the Doall loop are totally independent of each other.
    This implies that they can be executed in parallel if there are sufficient processors to
        handle different iterations.
    However. the computations within each iteration are still executed serially in program
        order.
(Doacross, Endacross)
    When the successive iterations of a loop depend on each other, we use the (Doacross,
        Endacross) pair to declare parallelism with loop-carried dependences.
    Synchronizalions must be performed between the iterations that depend on each other.
    For example, dependence along the J-dimension exists in the following program.
    We use Doacross to declare parallelism along the I-dimension, but synchronization
        between iterations is required.
(Cobegin, Coend)
    All computations specified within the block could be executed in parallel.
    But parallel processes may be created with a slight time difference in real
      implementations.
Program Flow Analysis: This phase reveals the program flow patterns in order to determine
data and control dependences in the source code.
     Scalar dependence analysis is extended below to structured data arrays or matrices.
     Depending on the machine structure, the granularities of parallelism to be exploited are
       quite different.
     Thus the flow analysis is conducted at different execution levels on different parallel
       computers.
     The flow analysis must also reveal code/data reuse and memory-access patterns.
Program Optimization:
     This refers to the transformation of user programs in order to explore the hardware
       capabilities as much as possible.
     In reality, most transformations are constrained by the machine architecture.
     The ultimate goal of program optimization is to maximize the speed of code execution.
     This involves the minimization of code length and of memory accesses and the
       exploitation of parallelism in programs.
     The optimization techniques include vectorization using pipelined hardware and
       parallelization using multiple processors simultaneously.
     Machine-dependent transformations are meant to achieve more efficient allocation of
       machine resources, such as processors, memory, registers, and functional units.
Iteration Space
     The n-dimensional discrete Cartesian space for n deep loops is called an iteration space.
     The iteration is represented as coordinates in the iteration space. The following example
        clarifies the concept of lexicographic order for the successive iterations in a loop nest.
Dependence Equations
    Let α and β be vectors of n integer indices within the ranges of the upper and lower
      bounds of the n loops.
    There is a dependence from S1 to S2 if and only if there exist α and β such that α is
      lexicographically less than or equal to β and the following system of dependence
      equations is satisfied:
Example
The distance and direction vectors for the dependence between iterations along three dimensions
of the array A are (1, 0, -l) and (<, = , >), respectively.
Subscript Separability
    When testing multidimensional arrays, we say that a subscript position is separable if its
     indices do not occur in the other subscripts.
    If two different subscripts contain the same index, we say they are coupled.
    If all the subscripts are separable, we may compute the direction vector for each subscript
     independently and merge the direction vectors on a positional basis with full precision.
    The first subscript is separable because index i does not appear in the other dimensions,
     but the second and third are coupled because they both contain the index j.
    The ZIV subscripts are separable because they contain no indices.
Subscript Partitioning
    We need to classify all the subscripts in a pair of array references as separable or as part
     of some minimal coupled group.
    A coupled group is minimal if it cannot be partitioned into two nonempty subgroups with
     distinct sets of indices.
    Once a partition is achieved, each separable subscript and each coupled group has
     completely disjoint sets of indices.
    Each partition may then be tested in isolation and the resulting distance or direction
     vectors merged without any loss of precision.
Test Categories
    The ZIV Test
    The SIV Test
    The MIV Test
    A dependence exists if and only if d is an integer and |d| = U - L, where U and L are the
     loop upper and lower bounds.
    For dependences that do exist, the dependence direction is given by
    The strong SIV test is thus an exact test that can be implemented very efficiently in a few
     operations.
    A bounded iteration space is shown in Fig. 10.6a. The case of a strong SIV test is shown
     in Fig. 10.6b.
    This corresponds to the intersection of the dependence equation with the line i = i’.
    To determine whether dependences exist, we simply need to check that the resulting
     value i is within the loop bounds and is either an integer or has a noninteger part equal to
     1/2.
    Starting at the outermost loop nest and working inward, we replace each index in a loop
     upper bound with its maximum value (or minimal if it is a negative term).
    We do the opposite in the lower bound, replacing each index with its minimal value (or
     maximal if it is a negative team).
    We evaluate the resulting expressions to calculate the minimal and maximal values for
     the loop index and then repeat for the next inner loop.
    This algorithm returns the maximal range for each index, all that is needed for SIV tests.
    A special case of MIV subscripts, called RDlV (restricted double—index variable)
     subscripts, have the form ( a1i + c1, a2j + c2). They are similar to SIV subscripts except
     that i and j are distinct indices.
    The development of a parallelizing compiler is limited by the difficulty of having to deal
     with many nonperfectly nested loops.
    The lack of dataflow information is often the ultimate limit on automatic compilation of
     parallel code.
      A basic environment provides a simple program tracing facility for debugging and
       performance monitoring or a graphic mechanism for specifying the task dependence
       graph.
    o The software and programming environments of the Cray Y-MP, Intel Paragon XP/S, and
        Connection Machine CM-5 are examined below.
Cray Y-MP Software
    o The Cray Y-MP ran with the Cray operating systems UNICOS or COS.
    o Two Fortran compilers CFT77 and CFT, provided automatic vectorizing, as did C and
        Pascal compilers.
    o Large library routines, program management utilities, debugging aids, and a Cray
        assembler were included in the system software.
    o Communications and applications software were also provided by Cray, by a third party,
        or from the public domain.
    o UNICOS for Y-MP was written in C and was a time-sharing OS extension of UNIX.
        UNTCOS supported optimizing, vectorizing, concurrentizing Fortran compilers and
        optimizing and vectorizing Pascal and C compilers.
    o COS was a multiprogramming, multiprocessing, and multitasking OS. It offered a batch
        environment to the user and supported interactive jobs and data transfers through the
        front-end system.
    o COS programs could run with a maximum of four processors up to 16M words of
        memory on the Y-MP system.
    o Migration tools were available to ease conversion from COS to UNICOS.
    o CFT77 was a multipass, optimizing, transportable compiler. It carried out vectorization
        and scalar optimization of Fortran 77 programs.
lntel Paragon XP/S Software
    o The Intel Paragon XP/S system was an extension of the Intel Ipsc/860 and Delta systems.
    o It was claimed to be a scalable, wormhole, mesh-connected multicomputer using
        distributed memory.
    o The processing nodes used were 50-MHz i860 XP processors.
    o Paragon ran a distributed UNIX/OS based on OSF technology and was in conformance
        with POSIX, System V.3, and Berkeley 4.3BSD.
    o The languages supported included C, Fortran, Ada, C++, and data-parallel Fortran.
    o The integrated tool suite included FORGE and CAST parallelization tools, Intel
        ProSolver parallel equation solvers, and BLAS, NAG, SEGlib, and other math libraries.
    o The programming environment provided an interactive parallel debugger (IPD) with a
        hardware-aided performance visualization system (PVS).
CM-5 Software
    o The software layers of the Connection Machine systems are shown in Fig. 11.3.
    o The operating system used was CMOST, an enhanced version of UNIX/OS which
        supported time-sharing and batch processing.
    o The low-level languages and libraries matched the hardware instruction-set architectures.
    o CMOST provided not only standard UNIX services but also supported fast IPC
        capabilities and dataparallel and other parallel programming models.
    o It also extended the UNIX I/O environment to support parallel reads and writes and
        managed large files on data vaults.
    o The Prism programming environment was an integrated Motif-based graphical
        environment.
    o Users could develop, execute, debug, and analyze the performance of programs with
        Prism, which could operate on terminals or workstations running the X-Window system.
   o High-level languages supported on the CM-5 included CM Fortran, C++, and *Lisp.
   o CM Fortran was based on standard Fortran 77 with array processing extensions of
     standard Fortran 90. CM Fortran also offered several extensions beyond Fortran 90, such
     as a FORALL statement and some additional intrinsic functions.
   o C* was an extension of the C programming language which supported data-parallel
     programming. Similarly, *Lisp was an extension of the Common Lisp programming
     language for data-parallel programming.
   o Both could be used to structure parallel data and to compute, communicate, and transfer
     data in parallel.
   o Three synchronization methods are described below based on who updates the atom and
     whether sole access is granted before or after the atomic operations:
            (1) Lock Synchronization: In this method, the atom is updated by the requester
                process and sole access is granted before the atomic operation. For this reason,
                it is also called pre-synchronization. The method can be applied for shared
                read-only access.
            (2) Optimistic Synchronization: This method also updates the atom by the
                requester process. But sole access is granted after the atomic operation, as
                described below. It is also called post-synchronization.
                     o A process may secure sole access after first completing an atomic
                         operation on a local version of the atom and then executing another
                         atomic operation on the global version of the atom.
                     o The second atomic operation determines if a concurrent update of the
                         atom has been made since the first operation was begun. If a
                         concurrent update has taken place, the global version is not updated;
                         instead, the first atomic operation is aborted and restarted from the
                         new global version.
                          The name optimistic is due to the fact that the method expects that
                             there will be no concurrent access to the atom during the execution
                             of a single atomic operation.
          (3) Server Synchronization: This method updates the atom by the server process
              of the requesting process, as suggested by the name. Compared with lock
              synchronization and optimistic synchronization, server synchronization offers
              full service.
                 An atom has a unique update server. A process requesting an atomic
                     operation on the atom sends the request to the atom‘s update server.
                 The update server may be a specialized server processor (SP) associated
                     with the atom‘s memory module.
Multitasking Environments
   Multitasking exploits parallelism at several levels:
           - Functional units are pipelined or chained together.
           - Multiple functional units are used concurrently.
           - HO and CPU activities are overlapped.
           - Multiple CPUs cooperate on a single program to achieve minimal execution time.
   In a multitasking environment, the tasks and data structures of a job must be properly
       partitioned to allow parallel execution without conflict.
   However, the availability of processors, the order of execution, and the completion of
       tasks are functions of the run-time conditions of the machine. Therefore, multitasking is
       nondeterministic with respect to time.
   On the other hand, tasks themselves must be deterministic with respect to results.
   To ensure successful multitasking, the user must precisely define and include the
       necessary communication and synchronization mechanisms and provide protection for
       shared data in critical sections.
                                             Figure 11.4a
     A main program forks a subroutine Si and then forks out three additional subroutines S2,
      S3 and S4. Macrotasking is best suited to programs with larger, longer-running tasks.
    The execution of these four subroutine -calls on a uniprocessor (processor 0) is done
      sequentially. Macrotasking can spread the calls across four processors.
    Note that each processor may have a different execution time. However, the overall
      execution time is reduced due to parallel processing.
Microtasking
      This corresponds to multitasking at the loop control level with finer granularity. Compiler
       directives are often used to declare parallel execution of independent or dependent
       iterations of a looping program construct.
      Figure 11.4b illustrates the spread of every four instructions of a Do loop to four
       processors simultaneously through microtasking.
      When the iterations are independent of each other, microtasking is easier to implement.
       When dependence does exist between the iterations, the system must resolve the
       dependence before parallel execution can be continued.
      Interprocessor communication is needed to resolve the dependences.
      In addition to working efficiently on parts of programs where the granularity is small,
       microtasking works well when the number of processors available for the job is unknown
       or may vary during the program's execution.
                                               Figure 11.4b
Autotasking
    The autotasking feature automatically divides a program into discrete tasks for parallel
      execution on a multiprocessor. In the past, macrotasking was achieved by direct
      programmer intervention. Microtasking was aided by an interactive compiler.
    Autotasking demands much higher degrees of automation. Only some Cray
      multiprocessors, like the Cray Y-MP and C-90, were provided with autotasking software.
    Autotasking is based on the microtasking design and shares several advantages with
      microtasking: very low overhead synchronization cost, excellent dynamic performance
      independent of the number of processors available, both large and small granularity
      parallelism, and so on.
COMPUTER ARCHITECTURE
    a) We define computer architecture as the arrangement by which the various system
       building blocks like processors, functional units, main memory, cache, data paths,
       and so on—are interconnected and interoperated to achieve desired system
       performance.
    b) Processors make up the most important part of a computer system. Therefore, in
       addition to (a), processor design also constitutes a central and very important element
      In Fig. 12.1 we see the schematic diagram of a simple computer system consisting of four
       processors, a large shared main memory, and a processor-memory bus.
      For the three subsystems, we assume the following performance figures:
                  i.    Each of the four processors can perform double precision floating point
                       operations at the rate of 500 million per second.
                 ii.   The shared main memory can read/write data at the aggregate rate of 1000
                       million 32-bit words per second.
                iii.   The processor-memory bus has the capability of transferring 500 million
                       32-bit words per second to/from main memory.
      This system exhibits a performance mismatch between the processors, main memory, and
       the processor memory bus.
      The system architect must pay careful attention to all such potential mismatches in
       system design. Otherwise, the sustained performance which the system can deliver can
       only equal the performance of the slowest part of the system—i.e. the bottleneck.
      In this particular case, a better investment for increased system performance could be
           a) Faster processor-memory bus, and
           b) Improved cache memory with each processor.
   At a given time, VLSl technology limits the left hand side in the above equation s, while
    the designer must select the two factors on the right. Aggressive exploitation of
    instruction level parallelism, with multiple functional units and more complex control
    logic, increases the chip area—and transistor count—per processor core.
   Of course system design would involve issues which are more complex than these, but a
    basic design issue is seen here:
        o How should the designers divide available chip resources among processors and,
            within a single processor, among its various functional elements?
        o Within a processor, a set of instructions are in various stages of execution at a
            given time—within the pipeline stages, functional un its, operation buffers,
            reservation stations, and so on. Recall that functional units themselves may also
            be internally pipelined.
        o To maintain the work flow oi‘ instructions within the processor, a superscalar
            processor makes use of branch prediction.
PROBLEM DEFINITION
   we must address the problem of detecting and exploiting the parallelism which is implicit
    within the instruction stream.
   We need a prototype instruction for our processor. We assume that the processor has a
    load-store type of instruction set, which means that all arithmetic and logical operations
    are carried out on operands which are present in programmable registers.
   Operands are transferred between main memory and registers by load and store
    instructions only.
   We assume a three-address instruction format, as seen on most RISC processors, so that a
    typical instruction for arithmetic or logical operation has the format:
     Data transfer instructions have only two operands—source and destination registers; load
      and store instructions to/from main memory specify one operand in the form of a memory
      address, using an available addressing mode.
    Effective address for load and store is calculated at the time of instruction execution.
    Conditional branch instructions need to be treated as a special category.
Review of dependence basic concepts
      Data Dependences: Assume that instruction Ik follows instruction Ij in the program. Data
       dependence between Ij and Ik means that both access a common operand.
      Of these, the first pattern of register access does not in fact; create a dependence, since
       the two instructions can read the common value of the operand in any order.
      The other three patterns of operand access do create dependences amongst instructions.
       Based on the underlined words shown above, these are known as RAW dependence,
       WAR dependence, and WAW dependence, respectively.
      Sometimes we need to show dependences between instructions using graphical notation.
      We shall use small circles to represent instructions, and double line arrows between two
       circles to denote dependences.
      The instruction at the head of the arrow is dependent on the instruction at the tail; if
       necessary, the type of dependence between instructions may be shown by appropriate
       notation next to the arrow.
      A missing arrow between two instructions will mean explicit absence of dependence.
      Single line arrows will be u.sod between instructions when we wish to denote program
       order without any implied dependence or absence of dependence.
      Figure 12.2 illustrates this notation.
      Often dependences are thus depicted in a basic block of instructions— i.e. a sequence of
       instructions with entry only at the first instruction, and exit only at the last instruction of
       the sequence.
           o Part (a) of Fig. 12.3 shows a basic block of six instructions, denoted I1 through I6
               in program order.
           o Part (b) of the figure shows a possible pattern of dependences as they may exist
               amongst these six instructions. For simplicity, we have not shown the type of each
               dependence, e.g. RAW(R3), etc.
Control Dependences
    In typical application programs, basic blocks tend to be small in length, since about 15%
      to 20% instructions in programs are branch and jump instructions, with indirect jumps
      and returns from procedure calls also included in the latter category.
    Because of typically small sizes of basic blocks in programs, the amount of instruction
      level parallelism which can be exploited in a single basic block is limited.
    Assume that instruction Ij is a conditional branch and that, whether another instruction Ik
      executes or not depends on the outcome of the conditional branch instruction Ij.
    In such a case, we say that there is a control dependence of instruction I k on Ij.
Resource Dependences
    This is possibly the simplest kind of dependence to understand, since it refers to a
      resource constraint causing dependence amongst instructions needing the resource.
    A resource dependence which results in a pipeline stall can arise for access to any
      processor resource— functional unit, data path, register bank, and so on.
    We can certainly say that such resource dependences will arise if hardware resources
      provided on the processor do not match the needs of the executing program.
MODEL OF A TYPICAL PROCESSOR
      In the process of translating a sequential source program into machine language, the
       compiler performs extensive syntactic and semantic analysis of the source program.
      There are several ways in which the compiler can contribute to the exploitation of
       implicit instruction level parallelism.
      One relatively simple technique which the compiler can employ is known as loop
       unrolling, by which independent instructions from multiple successive iterations of a loop
       can be made to execute in parallel.
      Unrolling means that the body of the loop is repeated n times for n successive values of
       the control variable— so that one iteration of the transformed loop performs foe work of
       n iterations of the original loop.
      Example: Loop unrolling
      Consider the following body of a loop in a user program, where all the variables except
       the loop control variable i are assumed to be floating point:
      Now suppose that machine code is generated by the compiler as though the original
       program had been written as:
      In the unrolled program fragment, the loop contains four independent instances of the
       original loop body—indeed this is the meaning of loop unrolling.
      Suppose machine code corresponding to the second program fragment is executing on a
       processor. Then clearly if the processor has sufficient floating point arithmetic
       resources—instructions from the four loop iterations can he in progress in parallel on the
       various functional units.
      It is clear that code length of the machine language program increases as a result of loop
       unrolling; this increase may have an effect on the cache hit ratio.
      Also, more registers are needed to exploit the instruction level parallelism within the
       longer unrolled loop.
OPERAND FORWARDING
   Operand forwarding, which helps in reducing the impact of true data dependences in the
     instruction stream.
   Consider the following simple sequence of two instructions in a running program:
       The result of the ADD instruction is stored in destination register R3, and then shifted
        right by four bits in the second instruction, with the shifted value being placed in R4.
        Thus, there is a simple RAW dependence between the two instructions—the output of the
        first is required as input operand of the second.
       In terms of our notation, this RAW dependence appears as shown in Fig. 12.6, in the
        form of a graph with two nodes and one edge.
REORDER BUFFER
 Since instructions execute in parallel on multiple functional units, the reorder buffer serves
  the function of bringing completed instructions back into an order which is consistent with
  program order.
 Entries in the reorder buffer are completed instructions, which are queued in program order.
  However, since instructions do not necessarily complete in program order, we also need a
  flag with each reorder buffer entry to indicate whether the instruction in that position has
  completed.
 Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the
  reorder buffer—instruction identifier, value computed, program-specified destination of the
  value computed, and a flag indicating whether the instruction has completed
 In Fig. 12.9, the head of queue of instructions is shown at the top, arbitrarily labeled as
  instr[i].
 This is the instruction which would be committed nest—-if it has completed execution. When
  this instruction commits, its result value is copied to its destination, and the instruction is
  then removed from the reorder buffer.
 The nest instruction to be issued in the issue stage of the instruction pipeline then joins the
  reorder buffer at its tail.
 If the instruction at the head of the queue has not completed, and the reorder butter is full,
  then further issue of instructions is held up—i.e. the pipeline stalls—because there is no free
  space in the reorder buffer for one more entry.
      We now take a brief look at how the use of reorder buffer addresses the various types of
       dependences in the program.
       (i) Data Dependence: A RAW dependence—i.e. true data dependence—will hold up the
       execution of the dependent instruction if the result value required as its input operand is
       not available. As suggested above, operand forwarding can be added to this scheme to
       speed up the supply of the needed input operand as soon as its value has been computed.
        (ii) Control Dependence: Suppose the instruction (s) in the reorder buffer belong to a
       branch in the program which should it have been taken—i.e. there has been a mis-
       predicted branch.
            o Clearly then the reorder buffer should be flushed along with other elements of the
               pipeline. Therefore the performance impact of control dependences in the running
               program is determined by the accuracy of branch prediction technique employed.
            o The reorder buffer plays no direct role in the handling of control dependences.
       (iii) Resource Dependences: If an instruction needs a functional unit to execute, but the
       unit is not free, then the instruction must wait for the unit to become free clearly no
       technique in the world can change that. In such cases, the processor designer can aim to
       achieve at least this: if a subsequent instruction needs to use another functional unit
       which is free, then the subsequent instruction can be executed out of order.
REGISTER RENAMING
    Amongst the instructions in various stages of execution within the processor, there would
     be occurrences of RAW, WAR and WAW dependences on programmable registers.
   As we have seen, RAW is true data dependence—since a value written by one instruction
     is used as an input operand by another. But a WAR or WAW dependence can be avoided
     if we have more registers to work with.
   We can simply remove such a dependence by getting the two instructions in question to
     use two different registers.
   But we must also assume that the instruction set architecture (ISA) of the processor is
     fixed—i.e. we cannot change it to allow access to a larger number of programmable
     registers.
   Therefore the only way to make a larger number of registers available to instructions
     under execution within the processor is to make the additional registers invisible to
     machine language instructions.
   Instructions under execution would use these additional registers, even if instructions
     making up the machine language program stored in memory cannot refer to them.
Working of register renaming for WAW dependence
   For example, let us say that the instruction:
      Both these instructions are writing to register R5, creating thereby a WAW dependence—
       i.e. output dependence—on register R5. Clearly, any subsequent instruction should read
       the value written into R5 by FSUB, and E the value written by FADD.
      Figure 12.10 shows this dependence in graphical notation.
      With additional registers available for use as these instructions execute, we have a simple
       technique to remove this output dependence.
      Let FSUB write its output value to a register other than R5, and let us call that other
       register X. Then the instructions which use the value generated by FSUB will refer to X,
       while the instructions which use the value generated by FADD will continue to refer to
       R5.
      Now since FADD and FSUB are writing to two different registers, the output dependence
       or WAW between them has been removed.
      When FSUB commits, then the value in R5 should be updated by the value in X—i.e. the
       value computed by FSUB. Then the physical register X, which is not a program visible
       register, can be freed up for use in another such situation.
      Note that here we have mapped—or renamed R5 to X, for the purpose storing the result
       of FSUB and thereby removed the WAW dependence from the instruction stream.
      The first FADD instruction is writing a value into R2, which the second FADD
       instruction is using, i.e. there is true data dependence between these two instructions.
      Let us assume that, when the first FADD instruction executes, R2 is mapped into
       program invisible register Xm. The latter FSUB instruction is writing another value into
       R2.
      Clearly, the second FADD (and other intervening instructions before FSUB) should see
       the value in R2 which is written by the first FADD and not the value written by FSUB.
       Figure 12.11 shows these two dependences in graphical notation.
      Instructions which use the value written by FSUB refer to Xn, while instructions which
       use the value written by the first FADD refer to Xm.
      The WAR dependence between the second FADD and FSUB is removed; but the RAW
       dependence between the two FADD instructions is respected via Xm.
      When the first FADD commits, the value in Xm is transferred to R2 and program
       invisible register Xm is freed up; likewise, later when FSUB commits, the value in X n is
       transferred to R2 and program invisible register Xm is freed up.
TOMASULO’S ALGORITHM
   The processor in system was designed with multiple floating point units, and it made use
    of an innovative algorithm for the efficient use of these units.
   The algorithm was based on operand forwarding over a common data bus, with tags to
    identity sources of data values sent over the bus.
   The algorithm has since become known as tomasulo’s algorithm, after the name of its
    chief designer, what we now understand as register ramming was also an implicit part of
    the original algorithm.
   Recall that, for register renaming, we need a set of program invisible registers to which
    programmable registers are re-mapped.
   Tomasulo‘s algorithm requires these program invisible registers to be provided with
    reservation stations of functional units.
   Figure 12.12 shows such a functional unit connected to the common data bus, with three
    reservation stations provided on it.
      When the needed operand value or values are available in a reservation station, the
       functional unit can initiate the required operation in the next clock cycle.
      At the time of instruction issue, the reservation station is filled out with the operation.
       code (op).
    If an operand value is available, for example in a programmable register, it is transferred
       to the corresponding source operand field in the reservation station.
    However, if the operand value is not available at the time of issue, the corresponding
       source tag (t1 and/or t2) is copied into the reservation station.
    The source tag identifies the source of the required operand. As soon as the required
       operand value is available at its source—which would be typically the output of a
       functional unit—the data value is forwarded over the common data bus, along with the
       source tag.
    This value is copied into all the reservation station operand slots which have the matching
       tag.
    Thus operand forwarding is achieved here with the use of tags. All the destinations which
       require a data value receive it in the same clock cycle over the common data bus, by
       matching their stored operand tags with the source tag sent out over the bus.
Example Tomasulo's algorithm and RAW dependence
 Assume that instruction I1 is to write its result into R4, and that two subsequent instructions
   I2 and I3 are to read—i.e. make use of—that result value.
 Thus instructions I2 and I3 are truly data dependent (RAW) dependence) on instruction I1.
   See Fig. 12.13.
      Assume that the value in R4 is not available when 12 and I3 are issued; the reason could
       be, for example, that one of the operands needed for Il is itself not available. Thus we
       assume that Il has not even started executing when I2 and I3 are issued.
      When I2 and I3 are issued, they are parked in the reservation stations of the appropriate
       functional units.
      Since the required result value from I1 is not available, these reservation station entries of
       I2 and I3 get source tag corresponding to the output of I1 —i.e. output of the functional
       unit which is performing the operation of Il .
      When the result of I1 becomes available at its functional unit, it is sent over the common
       data bus along with the tag value of its source—i.e. output of functional unit.
      At this point, programmable register R4 as well as the reservation stations assigned to I2
       and I3 have the matching source tag—since they are waiting for the result value, which is
       being computed by I1.
      When the tag sent over the common data bus matches the tag in any destination, the data
       value on the bus is copied from the bus into the destination.
      The copy occurs at the same time into all the destinations which require that data value.
       Thus R4 as well as the two reservation stations holding I2 and I3 receive the required
       data value, which has been computed by Il , at the same time over the common data bits.
      Thus, through the use of source tags and the common data bus, in one clock cycle, three
       destination registers receive the value produced by Il—programmable register R4, and
       the operand registers in the reservation stations assigned to I2 and I3.
BRANCH PREDICTION
   A basic branch prediction technique uses a so-called two bit predictor. A two-bit counter
    is maintained for every conditional branch instruction in the program.
   The two-bit counter has four possible states; these four states and the possible transitions
    between these states are shown in Fig. 12.15.
   When the counter state is 0 or 1, the respective branch is predicted as taken; when the
    counter state is 2 or 3, the branch is predicted as not taken.
   When the conditional branch instruction is executed and the actual branch outcome is
    known, the state of the respective two-bit counter is changed as shown in the figure using
    solid and broken line arrows.
   When two successive predictions come out wrong, the prediction is changed from branch
    taken to branch not taken, and vice versa.
   In Fig. 12.15, state transitions made on mis-predictions are shown using broken line
    arrows, while solid line arrows show state transitions made on predictions which come
    out right.