Module - 4 - Parallel Processing
Module - 4 - Parallel Processing
Answer: Cache
4. Parallelism
4.1 Introduction
Parallel Processing
✓ Parallel processing can be described as a class of techniques which enables the
   system to achieve simultaneous data-processing            tasks   to    increase   the
   computational speed of a computersystem.
✓ A parallel processing system can carry out simultaneous data-processing to
   achieve faster executiontime.
✓ For instance, while an instruction is being processed in the ALU component of the
   CPU, the next instruction can be read frommemory.
✓ The primary purpose of parallel processing is to enhance the computer processing
   capability and increase itsthroughput,
✓ A parallel processing system can be achieved by having a multiplicity of
   functional units that perform identical or different operationssimultaneously.
✓ The data can be distributed among various multiple functionalunits.
✓ The following diagram shows one possible way of separating the execution unit
   into eight functional units operating inparallel.
✓ The operation performed in each functional unit is indicated in each block if the
   diagram:
✓   The adder and integer multiplier performs the arithmetic operation with integer
    numbers.
✓   The floating-point operationsare separated into three circuits operating in parallel.
✓   The logic, shift, and increment operations can be performed concurrently on
    differentdata.
✓   All units are independent of each other, so one number can be shifted while
    another number is beingincremented.
✓ Parallel computers can be roughly classified according to the level at which the
   hardware supports parallelism, with multi-core and multi-processor computers
   having multiple processing elements within a singlemachine.
✓ In some cases parallelism is transparent to the programmer, such as in bit-level or
   instruction-levelparallelism.
✓ But explicitly parallel algorithms, particularly those that use concurrency, are more
   difficult to write than sequential ones, because concurrency introduces several new
   classesofpotentialsoftwarebugs,ofwhichraceconditionsarethemostcommon.
✓ Communication and synchronization between the different subtasks are typically
   some of the greatest obstacles to getting optimal parallel programperformance.
Types of Parallelism:
  1. Bit-level parallelism: It is the form of parallel computing which is based on the
     increasing processor’s size. It reduces the number of instructions that the system
     must    execute    in     order   to   perform   a   task   on   large-sized   data.
     Example: Consider a scenario where an 8-bit processor must compute the sum of
     two 16-bit integers. It must first sum up the 8 lower-order bits, then add the 8
     higher-order bits, thus requiring two instructions to perform the operation. A 16-
     bit processor can perform the operation with just oneinstruction.
  2. Instruction-level parallelism: A processor can only address less than one
     instructionforeachclockcyclephase.Theseinstructionscanbere-orderedand
     grouped which are later on executed concurrently without affecting the result of the
     program. This is called instruction-levelparallelism.
  3. Task Parallelism: Task parallelism employs the decomposition of a task into
     subtasks and then allocating each of the subtasks for execution. The processors
     perform execution of sub tasksconcurrently.
   4. Data-level parallelism (DLP) – Instructions from a single stream operate
      concurrently on several data – Limited by non-regular data manipulation
      patterns and by memorybandwidth
Architectural Trends
✓ When multiple operations are executed in parallel, the number of cycles needed to
   execute the program isreduced.
✓ However, resources are needed to support each of the concurrentactivities.
✓ Resources are also needed to allocate localstorage.
✓ The best performance is achieved by an intermediate action plan that uses
   resources to utilize a degree of parallelism and a degree oflocality.
✓ Generally, the history of computer architecture has been divided into four
   generations having following basic technologies−
      •   Vacuum tubes
      •   Transistors
      •   Integratedcircuits
      •   VLSI
✓ Till 1985, the duration was dominated by the growth in bit-levelparallelism.
✓ 4-bit microprocessors followed by 8-bit, 16-bit, and soon.
✓ To reduce the number of cycles needed to perform a full 32-bit operation, the
   widthofthedatapathwasdoubled.Lateron,64-bitoperationswereintroduced.
✓ The growth in instruction-level-parallelism dominated the mid-80s tomid-90s.
✓ The RISC approach showed that it was simple to pipeline the steps of instruction
   processingsothatonanaverageaninstructionisexecutedinalmosteverycycle.
✓ Growth in compiler technology has made instruction pipelines more productive.
✓ In mid-80s, microprocessor-based computers consistedof
       •   An integer processingunit
       •   A floating-pointunit
       •   A cachecontroller
       •   SRAMs for the cachedata
       •   Tagstorage
✓ As chip capacity increased, all these components were merged into a singlechip.
✓ Thus, a single chip consisted of separate hardware for integer arithmetic, floating
   point operations, memory operations and branchoperations.
✓ Other than pipelining individual instructions, it fetches multiple instructions at a
   time and sends them in parallel to different functional units whenever possible.
   This type of instruction level parallelism is called superscalarexecution.
FLYNN‘S CLASSIFICATION
✓ Flynn's taxonomy is a specific classification of parallel computer architectures that
   are based on the number of concurrent instruction (single or multiple) and data
   streams (single or multiple) available in thearchitecture.
✓ The four categories in Flynn's taxonomy are thefollowing:
                  1. (SISD) single instruction, singledata
                  2. (SIMD) single instruction, multipledata
                  3. (MISD) multiple instruction, singledata
                  4. (MIMD) multiple instruction, multipledata
✓ Instruction stream: is the sequence of instructions asexecuted by themachine
✓ Data Stream is a sequence of data including input, or partialor temporary result,
   called by the instructionStream.
✓ Instructions are decoded by the control unit and then ctrl unit send the instructions
   to the processing units for execution.•
✓ Data Stream flows between the processors and memory bidirectionally.
SISD
An SISD computing system is a uniprocessor machine which is capable of executing   a
single instruction, operating on a single datastream.
SIMD
   •   An SIMD system is a multiprocessor machine capable of executing the same
       instruction on all the CPUs but operating on different datastreams
✓ Machines based on an SIMD model are well suited to scientific computing since
   they involve lots of vector and matrixoperations.
✓ So that theinformation       can be passed to all the processing elements (PEs)
   organized data elements of vectors can be divided into multiple sets(N-sets for N PE
   systems) and each PE can process one dataset.
✓ Dominant representative SIMD systems is Cray’s vector processingmachine.
MISD
✓ An MISD computing system is a multiprocessor machinecapable                of executing
   different instructions on   different PEs but all of them operating on the same
   dataset .
✓ The system performs different operations on the same data set. Machines built
   using the MISD model are not useful in most of the application, a few machines
   are built, but none of them are availablecommercially.
MIMD
✓ An MIMD system is a multiprocessor machine which is capable of executing
   multiple instructions on multiple datasets.
✓ Each PE in the MIMD model has separate instruction and data streams;mtherefore
   machines built usingthis m odel are capable to any kind ofapplication.
✓ Unlike    SIMD     and    MISD     machines,   PEs   in   MIMD     mac h ines   work
   asynchronously.
✓ MIMD machines arebroa d ly categorized into
       • shared-memoryMIMD                  and
       • distributed-memoryMIMD
        based on the way PEs are coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the
PEs are connected to a single global memory and they all have access to it. The
communication between PEs in this model takes place through the shared memory,
modification of the data stored in the global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon Graphics machines
and Sun/IBM’s SMP (SymmetricMulti-Processing).
VECTOR ARCHITECTURES
✓ A multithreaded CPU is not a parallel architecture, strictly speaking; multithreading
   is obtained through a single CPU, but it allows a programmer to design and develop
   applications as a set of programs that can virtually execute in parallel:
   namely,threads.
✓ Multithreading is solution to avoid waiting clock cycles as the missing data is
   fetched: making the CPU manage more peer-threads concurrently; if a thread gets
   blocked, the CPU can execute instructions of another thread, thus keeping
   functional unitsbusy.
✓ Each thread must have a private Program Counter and a set of private registers,
   separate from otherthreads.
✓ In a traditional scalar processor, the basic data type is an n-bitword.
✓ The architecture often exposes a register file of words, and the instruction set is
   composed of instructions that operate on individualwords.
✓ In a vector architecture, there is support of a vector datatype, where a vector is a
   collection of VL n-bit words (VL is the vectorlength).
✓ There may also be a vector register file, which was a key innovation of the Cray
   architecture.
✓ Previously, vector machines operated on vectors stored in mainmemory.
✓ Figures 1 and 2 illustrate the difference between vector and scalar data types, and
   the operations that can be performed onthem.
✓ Vector load/store instructions provide the ability to do strided and scatter / gather
   memory accesses, which take data elements distributed throughout memory and
   pack them into sequential vectors/streams placed in vector/streamregisters.
✓ This promotes datalocality.
✓ It results in less data pollution, since only useful data is loaded from the memory
   system.
✓ It provides latency tolerance because there can be many simultaneous outstanding
   memoryaccesses.
✓ Vector instructions such as VLD and VST provide thiscapability.
                HARDWARE MULTITHREADING
Multithreading
   •   A mechanism by which the instruction streams is divided into several smaller
   streams
       (threads) and can be executed in parallel is calledmultithreading.
Hardware Multithreading
   • Increasing utilization of a processor by switching to another thread when one
       thread is stalled is known as hardwaremultithreading.
Thread
   • A thread includes the program counter,             the   register      state,   and   the
       stack. It isa lightweight process; whereas threads commonly share a single
       address space, processesdon't.
Thread Switch
   • The act of switching processor control from one thread to another within
       the same process. It is much less costly than a processorswitch.
 Process
    • A process includes one or more threads, the address space, and the
        operating system state. Hence, a process switch usually invokes the operating
        system, but not a threadswitch.
Types of Multi-threading
        1. Fine-grainedMultithreading
        2. Coarse-grainedMultithreading
        3. SimultaneousMultithreading
 Coarse-grained Multithreading
    A version of hardware multithreading that implies switching between threads only
        after significant events, such as a last-level cachemiss.
    • This change relieves the need to have thread switching be extremely fast and
        is much less likely to slow down the execution of an individual thread, since
        instructions from other threads will only be issued when a thread encounters
        a costlystall.
 Advantage
    •   To have very fast threadswitching.
    •   Doesn't slow downthread.
 Disadvantage
    • It is hard to overcome throughput losses from shorter stalls, due to pipeline
        start -upcosts.
    • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline
        must beemptied.
    •   New thread must fill pipeline before instructions cancomplete.
    • Due to this start-up overhead, coarse-grained multithreading is much more
        useful for reducing the penalty of high-cost stalls, where pipeline refill is
        negligible compared to the stalltime.
    Fine-grained Multithreading
   • A version of hardware multithreading that implies switching between threads
   after every instruction resulting in interleaved execution of multiple threads. It
   switches from one thread to another at each clockcycle.
   • This interleaving is often done in a round-robin fashion, skipping any threads
   that are stalled at that clockcycle.
  To make fine-grained multithreading practical, the processor must be able to switch
      threads on every clockcycle.
Advantage
  •    Vertical waste iseliminated.
  •    Pipeline hazards cannotarise.
  •    Zero switchingoverhead
  • Ability to hide latency within a thread i.e., it can hide the throughput losses
      that arise from both short and longstalls.
  •    Instructions from other threads can be executed when one threadstalls.
  •    High executionefficiency
  •    Potentially less complex than alternative high performanceprocessors.
Disadvantage
  •    Clock cycles are wasted if a thread has little operation toexecute.
  •    Needs a lot of threads toexecute.
  •    It is expensive than coarse-grainedmultithreading.
  • It slows down the execution of the individual threads, since a thread that is
      ready to execute without stalls will be delayed by instructions from other
      threads.
Simultaneous multithreading (SMT)
  • It is a variation on hardware multithreading that uses the resources of a
      multiple-issue, dynamically scheduled pipelined processor to exploit thread-
      level parallelism at the same time it exploitsinstruction level parallelism.
  • The key insight that motivates SMT is that multiple-issue processors often
      have more functional unit parallelism available than most single threads can
      effectively use.
   Since SMT relies on the existing dynamic mechanisms, it does not switch resources
        every cycle.
   • Instead, SMT is always executing instructions from multiple threads, to
        associate instruction slots and renamed registers with their properthreads.
Advantage
   • It is ability to boost utilization by dynamically scheduling              functional
       units among multiplethreads.
   •    It increases hardware designfacility.
   • It produces better performance and add resources to a fine grainedmanner.
Disadvantage
       It cannot improve performance if any of the shared resources are the limiting
       bottlenecks for theperformance.
   Physical memory uniformly shared by all processors, with equal access time to
        all words.
    •   Processors may have ocal cache memories. Peripherals also shared in some
    fashion.
   •    UMA architecture models are of two20types,
                 Symmetric:
                      • All processors have equal access to allperipheral
                          devices. All processors are identical.
                 Asymmetric:
                      • One processor (master) executes the operating system other
                         processors may be of different types and may be dedicated to
                         specialtasks.
Non Uniform Memory Access (NUMA) multiprocessors
• In shared memory multiprocessor systems, local memories can be            connected
    with every processor. The collections of all local memories form the global
    memory beingshared.
• In this way, global memory is distributed to all the processors. In this case, the
    access to a local memory is uniform for its corresponding processor as itisattached
    to the localmemory.
• But if one reference is to the local memory of some other remote processor,
    then the access is notuniform.
• It depends on the location of the memory. Thus, all memory words are not
    accessed uniformly. All local memories form a global address space accessible
    by allprocessors
• Programming NUMAs are harder but NUMAs can scale to larger sizes and
    have lower latency to localmemory
• Memory is common to all the processors. Processors easily communicate by
    means of sharedvariables.
• These systems differ in how the memory and peripheral resources are
    shared ordistributed
•   The access time varies with the location of the memoryword.
   Superscalar Processors
   It was first invented in 1987. It is a machine which is designed to improve the performance of the scalar
   processor. In most applications, most of the operations are on scalar quantities. Superscalar approach
   produces the high performance general purpose processors.
   The main principle of superscalar approach is that it executes instructions independently in different
   pipelines. As we already know, that Instruction pipelining leads to parallel processing thereby speeding up
   the processing of instructions. In Superscalar processor, multiple such pipelines are introduced for different
   operations, which further improves parallel processing.
   There are multiple functional units each of which is implemented as a pipeline. Each pipeline consists of
   multiple stages to handle multiple instructions at a time which support parallel execution of instructions.
   It increases the throughput because the CPU can execute multiple instructions per clock cycle. Thus,
   superscalar processors are much faster than scalar processors.
   A scalar processor works on one or two data items, while the vector processor works with multiple data items.
   A superscalar processor is a combination of both. Each instruction processes one data item, but there are
   multiple execution units within each CPU thus multiple instructions can be processing separate data items
   concurrently.
   While a superscalar CPU is also pipelined, there are two different performance enhancement techniques.
   It is possible to have a non-pipelined superscalar CPU or pipelined non-superscalar CPU. The
   superscalar technique is associated with some characteristics, these are:
      1. Instructions are issued from a sequential instruction stream.
      2. CPU must dynamically check for data dependencies.
      3. Should accept multiple instructions per clock cycle.
Superscalar Architecture
      •     Superscalar Architecture (SSA) describes a microprocessor design that execute more than one instruction at a
            time during a single clock cycle
      •     The design is sometimes called “Second Generation RISC”. Another term used to describe superscalar
            processors is multiple instruction issue processors.
      •     In a SSA design, the processor or the instruction compiler is able to determine whether an instruction can be
            carried out independently of other sequential instructions, or whether it has a dependency on another
            instruction and must be executed sequentially.
•   In a SSA, several scalar instructions can be initiated simultaneously and executed independently.
•   A long series of innovations aimed at producing ever- faster microprocessors.
•   Includes all features of pipelining but, in addition, there can be several instructions executing simultaneously
    in the same pipeline stage.
   •   SSA introduces a new level of parallelism, called instruction-level parallelism.
SimpleScalar Architecture
   •   SimpleScalar is an open source computer architecture simulator which is written using ‘C’ programming
       language.
   •   A set of tools that model a virtual computer system with CPU, Cache and Memory Hierarchy.
   •   Using the tool, users can model applications that simulate programs running on a range of modern
       processors and systems.
   •   The tool set includes sample simulators ranging from a fast functional simulator to a detailed.
Scalar to Superscalar
   •   The simplest processors are scalar processors. Each instruction executed by a scalar processor typically
       manipulates one or two data items at a time.
   •   In a superscalar CPU the dispatcher reads instructions from memory and decides which one can be run in
       parallel.
   •   Therefore a superscalar processor can be proposed having multiple parallel pipelines, each of which is
       processing instructions simultaneously from a single instruction thread.
Implement Superscalar
   •   A SSA processor fetches multiple instructions at a time, and attempts to find nearby instructions that are
       independent of each other and therefore can be executed in parallel.
   •   Based on the dependency analysis, the processor may issue and execute instructions in an order that differs
       from that of the original machine code.
   •   The processor may eliminate some unnecessary dependencies by the use of additional registers.
Superpipelining
   •  Superpipelining is based on dividing the stages into several sub- stages, and thus increasing the number of
     instructions which are handled by the pipeline at the same time.
   • For example, by dividing each stage into two sub-stages, a pipeline can perform at twice the speed in the
     ideal situation:
     ➢ No duplication of hardware is needed for these stages. 14 Superpipelining Figure: Duplication of
        hardware is for Superscalar
     ➢ Tasks that require less than half a clock cycle.
Superscalar Issues to Consider
   ➢ Tasks can be divided into the following
         o Parallel decoding
         o Superscalar instruction issue
         o Parallel instruction execution
   ➢ preserving sequential consistency of exception processing.
   ➢ preserving sequential consistency of execution.
   ➢ Parallel decoding – more complex task for scalar processors
   ➢ Parallel instruction execution task – While instructions are executed in parallel, instructions are usually.
     completed out of order in respect to a sequential operating procedure.
   ➢ Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies
     the restrictive effects of control and data dependencies on the processor performance.
  Limitations of Superscalar
     ➢ Instruction-fetch inefficiencies caused by both branch delays and instruction misalignment.
     ➢ not worthwhile to explore highly- concurrent execution hardware, rather, it is more appropriate to explore
       economical execution hardware.
     ➢ degree of intrinsic parallelism in the instruction stream (instructions requiring the same computational
       resources from the CPU).
     ➢ complexity and time cost of the dispatcher and associated dependency checking logic.
     ➢ branch instruction processing.
VLIW PROCESSORS
     ➢ Very long instruction word or VLIW refers to a processor architecture designed to take advantage of
       instruction level parallelism.
            o Instruction of a VLIW processor consists of multiple independent operations grouped together.
            o There are Multiple Independent Functional Units in VLIW processor architecture.
            o Each operation in the instruction is aligned to a functional unit.
            o All functional units share the use of a common large register file.
     ➢ This type of processor architecture is intended to allow higher performance without the inherent complexity of
       some other approaches.
  Different Approaches
  Other approaches to improving performance in processor architectures :
     •   Pipelining: Breaking up instructions into sub-steps so that instructions can be executed partially at the same
         time.
     •   Superscalar architectures: Dispatching individual instructions to be executed completely independently in
         different parts of the processor.
     •   Out-of-order execution: Executing instructions in an order different from the program
  Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are
  completed.However, operations 1 and 2 do not depend on any other operation, so they can be calculated
  simultaneously.If we assume that each operation can be completed in one unit of time then these three
  instructions can be completed in a total of two units of time. Giving an ILP of 3/2.
  VLIW Compiler
     •   Compiler is responsible for static scheduling of instructions in VLIW processor.
     •   Compiler finds out which operations can be executed in parallel in the program.
     •   It groups together these operations in single instruction which is the very large instruction word.
    •   Compiler ensures that an operation is not issued before its operands are ready.
VLIW Instruction
    •   One VLIW instruction word encodes multiple operations which allows them to be initiated in a single clock
        cycle.
    •   The operands and the operation to be performed by the various functional units are specified in the
        instruction itself.
    •   One instruction encodes at least one operation for each execution unit of the device.
    •   So length of the instruction increases with the number of execution units
    •   To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and on some
        architectures are much wider up to 1024 bits.
VLIW Instruction
ILP in VLIW
Consider the computation of y = a1x1 + a2x2 + a3x3
        On a sequential processor                                On the VLIW processor
                                                                 with 2 load/store units,
                                                              1 multiply unit and 1 add unit
cycle 1: load                                   cycle 1: load
a1 cycle 2: load                                            a1
x1 cycle 3: load                                          load
a2 cycle 4: load                              x1 cycle 2: load
x2                                                          a2
cycle 5: multiply z1 a1                               load x2
x1 cycle 6: multiply z2                               Multiply z1 a1
a2 x2 cycle 7: add y z1                      x1 cycle 3: load a3
z2                                                     load x3
cycle 8: load a3                                      Multiply z2 a2
cycle 9: load                                x2 cycle 4: multiply z3 a3
x3                                           x3
cycle 10: multiply z1 a3                              add y z1
x3 cycle 11: add y y z2                      z2 cycle 5: add y y
                                             z3
requires 11 cycles                           requires 5 cycles
Block Diagram
Working
   •   Long instruction words are fetched from the memory.
   •   A common multi-ported register file for fetching the operands and storing the results.
   •   Parallel random access to the register file is possible through the read/write cross bar.
   •   Execution in the functional units is carried out concurrently with the load/store operation of data between
       RAM and the register file.
   •   One or multiple register files for FX and FP data.
   •   Rely on compiler to find parallelism and schedule dependency free program code.