Generic Questions
Generic Questions
Q1.Why do you need more than one VM? What if you are provided one powerful
 VM than two or more?
 Q2.Your CEO asks you, how many VMs should we buy from AWS for an App?
 Q3. How do you get confidence on serverless computing ?
 Q4. How do you compare the performance and cost benefits with your on-premise
 infrastructure and Aws?
 Q5. Dou you think buying more VMs results in improving the application
 performance ? If not then how to get?
 Q6. What are the factors affect application performance ?
 Q7. How would you get confidence in future computational advancements in Cloud
 offered by Service providers?
 Q8. What are the fundamental computing limitations and bounds ?
 Q9. How would you think of designing several load balancers and scheduling
 algorithms as an R & D person?
 Q10. How would you optimize the cost and performance of your on-Premise or
 Cloud infrastructure?
An Introduction to   Parallel Computing
                          Agenda
     Motivating Factor:             The Need and
01   Human Brain
                               02   Feasibility of Parallel
                                    Computing
                                    Elements of Parallel
03 Moore’s Law                 04 Computing
     Factors Affecting
05   Parallel System           06 Parallel
                                    Programming
     Performance                    Models
     Computational                  Two Eras of
07   Power Improvement         08 Computing
                      Agenda
     Hardware architectures         Dependency Analysis &
10   for parallel processing
                               11   Conditions of
                                    Parallelism
     Levels of Software
12   Parallelism in Program
                               13 Software Parallelism Types
     Execution
                      From:
                “Robots After All,”
                 by H. Moravec,
                CACM, pp. 90-97,
                  October 2003.
                                                             Year of Introduction
             Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Why High-Performance/Parallel Computing?
             Higher speed (solve problems faster)
         1   Important when there are “hard” or “soft” deadlines;
             e.g., 24-hour weather forecast
             Higher throughput (solve more problems)
         2   Important when we have many similar tasks to perform;
             e.g., transaction processing
             Higher computational power (solve larger problems)
         3   e.g., weather forecast for a week rather than 24 hours,
             or with a finer mesh for greater accuracy
        Categories of supercomputers
          Uniprocessor; vector machine
          Multiprocessor; centralized or distributed shared memory
          Multicomputer; communicating via message passing
          Massively parallel processor (MPP; 1K or more processors)
Need of Parallel Computing
 The human brain consists of a large number (more than a billion) of neural cells that process
 information. Each cell works like a simple processor and only the massive interaction among all cells
 and their parallel processing makes the brain's abilities possible.
     Individual neuron response speed is slow. Aggregated speed with which complex calculations
     carried out by (billions of) neurons demonstrate feasibility of parallel processing.
Computing elements
Applications
Processing Nodes:
Each processing node contains one or more processing elements (PEs) or processor(s), memory
system, plus communication assist: (Network interface and communication controller)
    Cores communicate using shared cache   Cores communicate using on-chip           Cores communicate over external
    (Lowest communication latency)         Interconnects (shared system interface)   Front Side Bus (FSB)
                                                                                     (Highest communication latency)
                                       Overview
                 Eight processor cores sharing 24 MB of level 3 (L3) cache
           Each core is 2-way SMT (2 threads per core), for a total of 16 threads
Elements of Parallel Computing
• Computing Problems:
     •   Numerical Computing: Science and engineering numerical problems demand intensive integer and floating point
         computations.
     •   Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches.
• Operating Systems
     •   Manages the allocation of resources to running processes.
     •   Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory
         mapping, interprocessor communication.
     •   Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.
Elements of Parallel Computing
• Compiler Support
    •   Implicit Parallelism Approach
          • Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel
             constructs/code.
          • Source code written in conventional sequential languages
•   Hardware/Architecture related:
     • Total CPU computational power available.
     • Types of computation modes supported.(sequential models, functional models, and concurrent models.)
     • Shared address space Vs. message passing.
     • Communication network characteristics (topology, bandwidth, latency)
     • Memory hierarchy properties.
Computational Power Improvement
Multiprocessor
                 Uniprocessor
         C.P.I
1 2. . . .
                                      No. of Processors
Two eras of computing
Architectures
Applications
Architectures
Compilers
                    1940       1950           1960        1970      1980        1990            2000        2010      2020   2030
Hardware architectures for
parallel processing
Dominant representative SISD systems are IBM PC, Macintosh, and workstations.
                                                            Instruction
                                                              Stream
                                                Processor
Hardware architectures for
parallel processing
SIMD is a multiprocessor machine capable of executing the same instruction on all the CPUs
but operating on different data streams.
SIMD models are well suited to scientific computing since they involve lots of vector and matrix
operations.
Ex. Ci=Ai*Bi;
Dominant representative SIMD system is Cray’s vector processing machine
Hardware architectures for
parallel processing
Processor 1
Processor 2
                                                                       Processor N
Hardware architectures for
parallel processing
Ex: Y=sin(x)+cos(x)+tan(x)
Machines built using the MISD model are not useful in most of the applications; a few
machines are built, but none of them are available commercially
Hardware architectures for
parallel processing
                                  Instruction       Instruction    Instruction
                                   Stream 1          Stream 2       Stream N
Processor 1
Processor 2
                                                                  Processor N
Hardware architectures for
parallel processing
Processor 1
Processor 2
Processor N
   MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory MIMD.
Hardware architectures for
parallel processing
Shared-memory MIMD:
• All the PEs are connected to a single global memory and they all have access to it, referred
   as tightly coupled multiprocessor systems.
• Communication among PEs through the shared memory.
• Modification of the data stored in the global memory by one PE is visible to all other PEs.
Ex: Silicon Graphics machines and Sun/IBM’s SMP, API: OpenMP
Distributed-memory MIMD:
• All PEs have a local memory, referred as loosely coupled multiprocessor systems,
  communication between PEs through the interconnection network.
• PEs can be configured via to tree, mesh, cube topology and so on.
• Each PE operates asynchronously, and if communication/synchronization among tasks is
  necessary, they can do so by exchanging messages between them. Programming API:MPI.
Hardware architectures for
parallel processing
•   The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to
    extend as in distributed memory MIMD model.
•   Failures in a shared-memory MIMD affect the entire system, whereas this is not the case of the distributed
    model.
•   Shared memory MIMD architectures are less likely to scale because the addition of more PEs leads to memory
    contention.
•   This is a situation that does not happen in the case of distributed memory, in which each PE has its own
    memory.
•   As a result, distributed memory MIMD architectures are most popular today.
                                                                     IPC Channel             IPC Channel
     Message passing:
         •    Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages.
     Data parallel:
          • More regimented, global actions on data (i.e the same operations over all elements on an array or vector)
          • Can be implemented with shared address space or message passing.
Case Study: An example
                                Task Dependency Graph
                     Task:                                           0
 0                   Computation
            A                             A                          1          A           Idle
 1                   run on one
                     processor                                       2
 2
                                                    Comm             3                      Comm
 3          B                                                                   C
                                                                     4                                         Assume computation time for each task
 4                               B                   C                                       B
                                                                                                                             A-G = 3
                                                                     5
 5                                                                                                          Assume communication time between parallel
            C                                                 Comm   6          F                                             tasks
 6                                                                                                                              1
                                                                     7                       D
                                                                                                              Assume communication can overlap with
 7                       D                E               F
                                                                     8                                                    computation
 8          D                                                                   E                                   Speedup on two processors
                                                                     9                      Idle                       T1/T2 = 21/16 = 1.3
 9                                                       Comm
                                                                     10
 10                                                                                         Comm
            E                             G                          11
 11
                                                                     12         Idle         G
 12
                                                                     13
 13         F
                             What would the speed be with 3          …
 …                           processors?
                             4 processors? 5 … ?                     21
 21                                                                             P0     P1          T2 =16
                T1 =21                                                   Time
     Time
Session 2: Parallel
   Computing
Parallel Programs: Definitions
 A parallel program is comprised of a number of tasks running as threads (or processes) on a number of processing
 elements that cooperate/communicate as part of a single parallel computation.
 Task:
      • Arbitrary piece of undecomposed work in parallel computation
      • Executed sequentially on a single processor; concurrency in parallel computation is only across tasks.
 Parallel or Independent Tasks:
      • Tasks that with no dependencies among them and thus can run in parallel on different processing elements.
 Parallel Task Grain Size: The amount of computations in a task.
 Process (thread):
      • Abstract program entity that performs the computations assigned to a task
      • Processes communicate and synchronize to perform their tasks
 Processor or (Processing Element):
      • Physical computing engine on which a process executes sequentially
      • Processes virtualize machine to programmer
            • First write program in terms of processes, then map to processors
 Communication to Computation Ratio (C-to-C Ratio): Represents the amount of resulting communication between
 tasks of a parallel program
Levels of parallelism
Levels of parallelism are decided based on the lumps of code (grain size) that can be a potential
candidate for parallelism.
•   The idea is to execute concurrently two or more single-threaded applications, such as compiling, text
    formatting, database searching, and device simulation.
•   Major levels of granularities:
•   Coarse grain (or task level)
•   Medium grain (or control level)
•   Fine grain (data level)
Levels of parallelism
                                                                              }
According to task (grain) size               Jobs or programs
                                 Level 5
                                            (Multiprogramming)
                                                                                  Coarse
 Increasing                                              i.e multi-tasking        Grain
                                                                             }
 communications
 demand and                      Level 4      Subprograms, job                             Higher
 mapping/scheduling                         steps or related parts
                                               of a program                                Degree of
 overheads                                                                        Medium   Software
                                                                                  Grain    Parallelism
 Higher                                                                                    (DOP)
 C-to-C                          Level 3   Procedures, subroutines,
 Ratio                                         or co-routines
                                                                                            More
                                                                             }
                                                                                            Smaller
                                 Level 2      Non-recursive loops or                        Tasks
 Thread Level Parallelism                      unfolded iterations
             (TLP)                                                                Fine
                                                                                  Grain
   Instruction                   Level 1        Instructions or
   Level Parallelism (ILP)                        statements
Computational Parallelism and
Grain Size
Task grain size (granularity) is a measure of the amount of computation involved in a task in parallel
computation:
                            Tasks
                                                        A possible
    S1
    ..                  Assume task S2 follows task S1 in sequential program order
    ..              1   True Data or Flow Dependence: Task S2 is data dependent on task S1 if an execution path exists from S1
                        to S2 and if at least one output variable of S1 feeds in as an input operand used by S2
    S2                              Represented by S1 àS2 in task dependency graphs
   Program
   Order
                    2   Anti-dependence: Task S2 is anti-dependent on S1, if S2 follows S1 in program order and if the output of
                        S2 overlaps the input of S1
             Name                   Represented by S1 àS2 in dependency graphs
             Dependencies
                    3   Output dependence: Two tasks S1, S2 are output dependent if they produce the same output variables
                        (or output overlaps)
                                    Represented by S1 à S2 in task dependency graphs
Dependencies
                                                                                c
Dependency Graph Example
   Here assume each instruction is treated as a task:
           Anti-dependence:
           (S2, S3)
           i.e. S2 ¾® S3                                      S3
                                                                    R1 ¬ R3
                                   Dependency graph
Dependency Graph Example
Here assume each instruction is treated as a task:
                                                                                                  MIPS Code
                                     Task Dependency graph
                                                                                  1      ADD.D        F2, F1, F0
                Æ
                                                                       Order      2      ADD.D        F4, F2, F3
                                              1                                   3      ADD.D        F2, F2, F4
                                        ADD.D F2, F1, F0                          4      ADD.D        F4, F2, F6
                                                                                Output Dependence:
                                                                                (1, 3)   (2, 4)
                                                                                i.e. 1 ¾® 3        2 ¾® 4
                                              4                                 Anti-dependence:
                                        ADD.D F4, F2, F6                        (2, 3) (3, 4)
                                                                                i.e. 2 ¾® 3        3 ¾® 4
Conditions of Parallelism
             • Control Dependence:           Algorithm & Program Related
                                                             P5               +2      P3     M
             P2                                 P4                    L
                      +1                   +3
                                                         ¸                        A
                                                                                                            +3      P4
                                                                      E    +3         P4 G                      C               A              F
                                                                          C
       •     Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.
•   Maximize parallel speedup by:
       •     Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads.
       •     Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.
•   Performance Scalability:
           Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of
             processors) are increased.
Parallel Performance Metrics
Efficiency, Utilization, Redundancy, Quality of Parallelism
System Efficiency: Let O(n) be the total number of unit operations performed by an n-processor system and T(n) be
the parallel execution time in unit time steps:
      • In general T(n) << O(n) (more than one operation is performed by more than one processor in unit time).
... p processors
                           tp        (1-f)ts / p
Another Way: Speed UP
                ts                1                                         1
S ( p) =                   =                                   lim S ( p) =
               (1 - f )t s       (1 - f )                      p ®¥         f
         fts +               f +
                   p                p
                      ts                1                      1
      S ( p) =                   =                lim S ( p) =
                     (1 - f )t s       (1 - f )   p ®¥         f
               fts +               f +
                         p                p
                 Speedup
f = 0.01
f = 0.1
                                                   f= 0.85
                                    Processors
Super-linear Speedup (S(n) > n )
Start Time
                                             ts
                    ts/p
Sub-space                                     Dt
  search
                                                   Solution found
                              xts/p
                           x indeterminate
Super-linear Speedup (S(n) > n )
   (b) Searching each sub-space in parallel
                                                                          ts
                                                                  x   ´
                                                                               + Dt
                                                                          p
                                                    S(p)    =
                                                                          Dt
         Dt
                     Solution found
Super-linear Speedup (S(n) > n )
  Worst case for sequential search when solution found in last sub-space search. Then
  parallel version offers greatest benefit, i.e.
                p-1
                 p        ´ ts    + Dt
                                                         as Dt tends to go to zero
   S(p) =                                  ® ¥
                            Dt
Least advantage for parallel version when solution found in first sub-space search of the
sequential search, i.e.                                  Dt
                                               S(p) =           =1
                                                         Dt
Actual speed-up depends upon which subspace holds solution but could be extremely large.
Examples
• 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum
  speedup we should expect from a parallel version of the program executing on 8 CPUs?
                                             1
                                 y£                       @ 5.9
                                    0.05 + (1 - 0.05) / 8
• An oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You
  can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of
  the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on
  8 processors?
• Machine size increases, the work load (or problem size) is also increased
• Let, Ts to perform sequential operations, Tp(n,W) to perform parallel operation of problem size or
  workload W using n processors;
• Then the speedup with n processors is:
• Assuming that parallel operations achieve a linear speedup
• (i.e. these operations take 1/n of the time to perform on one processor)
• Then, Tp(1,W) = n. Tp(n,W)
• Let α be the fraction of sequential work load i.e.
• Then the speedup can be expressed as :
Speedup (Gustafson’s Law)
Cost Optimal