0% found this document useful (0 votes)
12 views90 pages

Pda 3

The document outlines a course on Parallel and Distributed Algorithms for third-year Computer Science students, covering various models of parallel programming, performance issues, and algorithmic models. It discusses the PRAM (Parallel Random Access Machine) model, emphasizing its significance in understanding parallel computation and its limitations in practical applications. Additionally, it addresses implicit and explicit parallelism, performance metrics, and the challenges of parallelizing algorithms that are trivial in sequential settings.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views90 pages

Pda 3

The document outlines a course on Parallel and Distributed Algorithms for third-year Computer Science students, covering various models of parallel programming, performance issues, and algorithmic models. It discusses the PRAM (Parallel Random Access Machine) model, emphasizing its significance in understanding parallel computation and its limitations in practical applications. Additionally, it addresses implicit and explicit parallelism, performance metrics, and the challenges of parallelizing algorithms that are trivial in sequential settings.

Uploaded by

Darius Ö/
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Parallel and Distributed

Algorithms

A course for the 3rd year students


(Major in Computer Science)

Parallel Programming:
Models and Performance Issues
Contents
 The “ideal parallel  Analytical Modeling of
computer”: PRAM Parallel Programs
 Categories of PRAMs  Performance Issues
 PRAM algorithm  Parallel Performance:
examples Metrics
 Algorithmic Models  Parallel Performance:
 Data-Parallel Model Models
 Task Graph Model  Using DAGs and Task
Graphs
 Work Pool Model
 A 5-Step Guide to
 Master-Slave Model
Parallelization
 Pipeline (Producer-
 Amdahl’s and
Consumer) Model
Gustafson's Laws
Implicit and Explicit Parallelism
 A misconception occurs that parallel programs are difficult
to write as compared to sequential programmes
 This is not at all true
 What can be hard is to revert from a sequential description
of the program to a parallel code
 This is called implicit parallelism
 Means the programmer does not specify parallelism
(nor control it either), but the compiler and/or run-time
support system can automatically extract and exploit it
 The other case, more important to us, is explicit parallelism
 Means that parallelism is specified in the source code
by the programmer (who is in control), using special
language constructs, complex directives, or library calls
Implicit Parallelism
Q: “Can you say if ANY of the statements included in the
next sequences may be executed in parallel?”
• a = 2; b = 3;
A: Yes (no dependencies between assignments)
• a = 3; b = a;
A: No (parallel execution results in non-determinism)
• for (i=0; i<100; i++) b[i] = i;
A: Yes (no dependencies between independent iterations)
• a = f(x); b = g(x);
A: Cannot say without being aware of the code for the
functions f and g (i.e., dependencies may exist between
the two statements if f and g update similar variables)
Explicit Parallel Programming: layers
• Platforms & physical organization
hardware layer
• Communication network
• Logical organization - programmer’s layer w.r.t.
the platform
• Process-processors mappings - execution layer
The Ideal Parallel Computer
Is it reasonable to look for a single ideal model?
(similar to von Neumann model for sequential computing)
PRAM - Parallel Random Access Machine
consists of:
• p processors, working in lock-step, synchronous
manner on the same program instructions
• each connected to an unbounded shared memory
(access time to shared memory costs one step)
• each processor may have its local memory
Why do we need a PRAM model?
PRAM abstracts away communication, allows to
focus on the parallel tasks
The PRAM MODEL of Execution
n RAM processors connected to a common memory of m cells
1
2
P1 3

P2
. Common Memory
. ?
.
Pi .
.
Pn
m

ASSUMPTION: at each time unit each processor Pi can either read a cell of
memory, make an internal computation or write in [another] memory cell
CONSEQUENCE: a pair of processors Pi Pj communicates in constant time!
Pi writes the message in cell x at time t
Pi reads the message in cell x at time t+1
Why PRAM is an Ideal Parallel Computer?
 PRAM is a natural extension of the sequential model of
computation (RAM):
 a processor, its local memory and the local projection of
the program forms a RAM
 the PRAM simply provides a means of interaction among
processors at no cost
 An algorithm for PRAM might lead to a good algorithm for
a real machine
 BUT: If something cannot be efficiently solved on PRAM, it
cannot be efficiently done on any practical machine (based
on our current technology)
 And: It’s not feasible to manufacture PRAMs! The real cost to
connect p processors to m memory cells, so that accesses
do not interfere, is o(pm), huge for any practical values of m
What is the Mission: Impossible …
Computing in constant time
• Archimedes: Give me a lever
long enough and a place to stand
and I will move the earth

• NOWDAYS….
Give me a parallel machine with
enough processors and I will
find the smallest number in any
giant set in a constant time!
PRAM Complexity Measures
• For each individual processor:
– time: number of instructions executed
– space: number of memory cells accessed
• For the PRAM machine
– parallelism: maximum number of active processors
– time: time taken by the longest running processor

! There may be other technical issues for PRAM:
– How processors are activated
– How shared memory is accessed
Categories of PRAMs (based on
type of processor activation)
• P0 places the number of processors (p) in the p
designated shared-memory cell
– each active Pi (where i < p), starts executing
after comparing its index with p
– needs O(1) time to activate ...
– all processors halt when P0 halts
• Active processors explicitly activate additional
processors via FORK instructions
– tree-like activation
– O(log p) time to activate

1 00 0 00 0
The i th processor will activate a processor 2i
and a processor 2i+1
Problems related to PRAM
Processor Activation

• Too many interconnections → synchronization problems


• However, it is the best conceptual model for designing
efficient parallel algorithms
– due to simplicity and possibility of simulating efficiently
PRAM algorithms on more realistic parallel architectures

Basic parallel statement


For each x, PRAM will assign a
processor which will execute
for all x in X do in parallel instruction(x)
instruction (x)
Categories of PRAMs (based on type
of shared memory access)
 Restrictions may be imposed for simultaneous read/write
operations in the common memory
 There are four main classes (the difference is on how the
simultaneous accesses are handled):
• Exclusive Read, Exclusive Write - EREW PRAM (the only
constructive, “real” model)
• Concurrent Read, Exclusive Write - CREW PRAM (realistic)
• Exclusive Read, Concurrent Write - ERCW PRAM
(introduced here for the sake of completeness)
• Concurrent Read, Concurrent Write - CRCW PRAM (the
only model allowing any type of access)
CRCW: Resolving Concurrent Writes
• Allowing concurrent read access does not create semantic
discrepancies in the program
• Concurrent write access to a memory location requires
arbitration; ways of resolving concurrent writes:
• Common – all writes must write the same value
• Arbitrary – arbitrary write succeeds
• Priority – the write with highest priority succeeds
• Sum – the sum of the written values is stored
Computational Power:

• Let’s try some exercises on PRAMs!


PRAM Algorithm Example 1
Problem (parallel prefix): use EREW PRAM to sum numbers stored at
m0, m1, …, mn-1, where n=2k for some k. The result should be stored at m0.

Algorithm for processor pi: 1 8 3 2 7 3 1 4


for (j=0; j<k; j++)
~ if (i % 2^(j+1) == 0) { p0 p2 p4 p6
~ a = read(mi);
~ b = read(mi+2^j); 9 8 5 2 10 3 5 4
~ write(a+b, mi);
~ }
p0 p4

14 8 5 2 15 3 5 4
Example for k=3
p0

29 8 5 2 15 3 5 4
PRAM Example Notes

• the program is written in SIMD (and SPMD) format

• the inefficiency caused by idling processors is clearly visible


• can be easily extended for n not power of 2
• takes log2(n) rounds to execute

Important!
 using a similar approach, based on the idea of parallel prefix, it can
be shown that:

Any CRCW PRAM can be simulated by an EREW PRAM


with a slowdown factor of O(log22n)
PRAM Algorithm Example 2
• Initially
– table A contains values 0 and 1
– output contains value 0

• The program executed on a priority CRCW PRAM computes


the “Boolean OR” of A[1], A[2], A[3], A[4], A[5]
PRAM Algorithm Example 3
Problem: use Sum - CRCW PRAM with n2 processors to sort
n numbers stored at x0, x1, …, xn-1.
CRCW condition: processors can write concurrently 0s and
1s in a location, the sum of values will actually be written
Question: How many steps would it take?
1. O(n log n)
2. O(n)
3. O(log n)
4. O(1)
5. less then (n log n)/ n2
PRAM Example 3 (cont.)
Note: We will mark processors pi,j for 0<=i,j<n
Algorithm for processor pi,j:
1 7 3 9 3 0 x[]
a = read(xi); ~
b = read(xj); ~
m 0 m 1 m 2 m3 m 4 m 5
if ((a>b)|| ((a==b)&&(i>j)))
~ write(1, mi); 0 1 1 1 1 0
if (j==0) { 0 0 0 1 0 0
~ b = read(mi); ~ ~ 0 1 0 1 0 0
write(a, xb); 0 0 0 0 0 0
} 0 1 0 1 1 0
1 1 1 1 1 0

O(1) sorting algorithm! 1 4 2 5 3 0 m[]

(Chaudhuri, p.90-91) 0 1 3 3 7 9 x[]

Find the small error in the matrix!


The beauty and challenge of parallel algorithms
Problems that are trivial in sequential setting can be quite
interesting and challenging to parallelize.
Homework: Compute sum of n numbers
How would you do it in parallel?
• using n processors

• using p processors
• when communication is cheap
• when communication is expensive
Algorithmic Models
 try to offer a common base to the development,
expressing and comparisons of parallel algorithms
 generally, they use the architectural model of
shared memory parallel machine (multi-processor)
 shared memory is a useful abstraction from the
programmer point of view, especially for the early
phases of algorithm design
 communication is kept as simple as possible
 usual causes of inefficiency are eliminated
Parallel Algorithmic Models
• Data-Parallel Model

• Task Graph Model


• Work Pool Model
• Master-Slave Model
• Pipeline (Producer-Consumer) Model
Data Parallel Model
 Working principle
• divide data up amongst processors
• process different data segments in parallel
• communicate boundary information, if necessary
 Features
• includes loop parallelism
• well suited for SIMD machines
• communication is often implicit
Task Graph Model
 decompose algorithm into different sections
 assign sections to different processors
 often uses fork()/join()/spawn()
 usually does not yield itself to high level of parallelism
Work Pool Model
• dynamic mapping of tasks to processes
• typically small amount of data per task
• the pool of tasks (priority queue, hash table, tree) can be
centralized or distributed

get task
work pool
P3
t8 t7 process
t0
t2 t3 task
P0
possibly add task
P1
P2
Processor Farm (Master-Slave) Model
 master generates and allocates tasks
 can be also hierarchical/multilayer
 master potentially a bottleneck
 overlapping communication and computation at the
master often useful
Pipelining
a sequence of tasks whose execution can overlap
 sequential processor must execute them sequentially,
without overlap
 parallel computer can overlap the tasks, increasing
throughput (but not decreasing latency)
Predicting and Measuring
Parallel Performance
• Building parallel versions of software can enable it to run:
– a given data set in significantly less time
– multiple data sets in a fixed amount of time
– large-scale data sets that are prohibitive with sequential
software
• These are visible cases of performance increase; but what
about the parallel performance in the other cases?
– Traditional measures like MIPS and MFLOPS really
don’t get a parallel computation performance
• Example: Clusters, that can have very high FLOPS,
may be poorer in accessing all data in the cluster
Metrics for Parallel Systems
and Algorithms Performance
• We need to define some systematic measures on parallel
performance:
– Execution Time
– Speedup
– Efficiency
– System Throughput
– Cost-effectiveness
– Granularity
– Scalability
– Degree of Concurrency (Parallelism, Utilization)
– Data access speed etc.
Analytical Modeling - Basics
• A sequential algorithm is evaluated by its runtime
(in general, asymptotic runtime as a function of
input size).
• The asymptotic runtime of a sequential program is
identical on any serial platform.
• The parallel runtime of a program depends on many
factors: input size, the number of processors, the
communication parameters of the machine a.o.
• Therefore:
– A parallel algorithm must be analyzed in the
context of the underlying platform.
– A parallel system is a combination of a parallel
algorithm and an underlying platform.
Performance Issues
A number of performance measures are intuitive:
• “Wall clock” time – the time from the start of the 1st
processor to the stopping time of the last processor
– How does this scale with the no. of processors if
the program is ported to another machine?
• Speedup, or how much faster is the parallel version?
– This begs the obvious follow-up question:
What’s the baseline serial version we compare
with? Can we “use” a suboptimal serial program
to start with? This can make our parallel program
look faster, maybe…
Other Intuitive Performance Metrics
Granularity
• fine grained: large number of small tasks
• coarse grained: small number of large tasks
Degree of Concurrency
• the maximal number of tasks that can be executed simultaneously
Critical Path
• the costliest directed path between any pair of start and finish
nodes in the task dependency graph
• the cost of the path is the sum of the weights of the nodes
Task Interaction Graph
• tasks correspond to nodes and an edge connects two tasks if they
communicate/interact with each other
Performance Metrics for Parallel
Systems: Execution Time
• Serial runtime of a program is the time elapsed
between the beginning and the end of its
execution on a sequential computer.

• The parallel runtime is the time that elapses


from the moment the first processor starts to
the moment the last processor finishes
execution.

• We may denote the serial runtime by TS and


the parallel runtime by TP .
Overhead in Parallel Programs
• If I use two processors, shouldn’t my program
run twice as fast?

• No! A lot of “overhead” tasks, including wasted


computation, communication, idling, and
contention cause degradation in performance.
Execution time and overhead
• The response time – measures interval between
submission of a request until the first response
is produced
• Execution Time
– Parallel runtime, Tp
– Sequential runtime, Ts
• Total Parallel Overhead: To  pT p  Ts
• The overhead is any combination of excess or
indirect computation time, memory, bandwidth,
etc.
Typical Overhead

The execution profile of a hypothetical parallel program executing on


eight processing elements. Profile indicates times spent performing
computation (both essential and excess), communication, and idling.
Minimizing Overhead
• Traditional sources of overhead:
– Function call: invoking a function incurs the
overhead of branching and modifying the stack
pointer regardless of what that function does
– Recursion
• When we can choose among several algorithms,
each of which have known characteristics, their
overhead is different
• Overhead can influence the decision whether
we want or not to parallelize a piece of code!
Sources of Overhead in
Parallel Programs
• Inter-process interactions: Processors working
on any non-trivial parallel problem will need
to communicate to each other.
• Idling: Processes may idle because of load
imbalance, synchronization, or serial
components.
• Excess Computation: This is computation not
performed by the serial version.
– This is because the serial algorithm is difficult to
parallelize, or that some computations are repeated
across processors to minimize communication.
Total Parallel Overhead
• Let Tall be the total time collectively spent by all
the processing elements.
– TS is the serial time.
– Observe that Tall - TS is then the total time
spend by all processors combined in non-useful
work. This is called the total overhead.
– The total time collectively spent by all the
processing elements
Tall = p TP (p is the number of processors).
• The overhead function (To) is therefore given by
To = p TP - TS (1)
Performance Metrics for
Parallel Systems: Speedup
• Speedup is the most likely used measure of
parallel performance
• If Ts is the best possible serial time and Tn is
the time taken by a parallel algorithm on p
processors
Ts
s
Tp
• Linear speedup occurring when Tp = Ts/p, is
considered ideal
• Super-linear speedup can happen in some cases
Speedup Definition Variability (1)
• Exactly what is meant by Ts (i.e. the time
taken to run the fastest serial algorithm on
one processor)
– One processor of the parallel computer?
– The fastest serial machine available?
– A parallel algorithm run on a single processor?
– Is the serial algorithm the best one?
• To keep things fair, Ts should be the best
possible time in the serial world
Speedup Definition Variability (2)
A slightly different definition of speedup:
• The time taken by the parallel algorithm on one
processor divided by the time taken by the
parallel algorithm on N processors
• However, this is misleading since many parallel
algorithms contain additional operations (e.g. the
communication) to accommodate the parallelism
• Result: Ts is increased thus exaggerating the
speedup
Speedup Example
• Consider the problem of parallel bubble sort.
• The serial time for bubble sort is 150 seconds.
• The parallel time for odd-even sort (efficient
parallelization of bubble sort) is 30 seconds.
• The speedup would appear to be 150/30 = 5.
• But is this really a fair assessment of the system?
• What if serial quicksort only took 20 seconds? In
this case, the speedup is 20/30 = 0.67! This is a
more realistic assessment of the system.
Speedup Bounds
• Speedup can be as low as 0 (the parallel program
never terminates).
• Speedup, in theory, should be upper bounded by p -
after all, we can only expect a p-fold speedup if we
use times as many resources.
• A speedup greater than p is possible only if each
processing element spends less than time TS / p
solving the problem.
• In this case, a single processor could be time-sliced
to achieve a faster serial program, which contradicts
our assumption of fastest serial program as speedup
basis
Factors That Limit Speedup
• Computational (Software) Overhead
– Even with a completely equivalent algorithm, software
overhead arises in the concurrent implementation
• Poor Load Balancing
– Speedup is generally limited by the speed of the slowest
node. So an important consideration is to ensure that
each node performs the same amount of work
• Communication Overhead
– Assuming that communication and calculation cannot
be overlapped, then any time spent communicating the
data between processors directly degrades the speedup
Linear Speedup
• Whatever definition is used, the ideal is to
produce linear speedup (N, using N cores)
• However in practice the speedup is reduced
from its ideal value of N
• For applications that scale well, the speedup
should increase at or close to the same rate of
increase in the number of processors (threads)
• Super-linear speedup results when
– unfair values are used for Ts
– differences in the nature of the hardware used
Superlinear Speedup
One reason for superlinearity is that the parallel version does
less work than corresponding serial algorithm

Searching an unstructured tree for a node with a given label, `S', on


two processing elements using depth-first traversal. The two-
processor version with processor 0 searching the left subtree and
processor 1 searching the right subtree expands only the shaded
nodes before the solution is found. The corresponding serial
formulation expands the entire tree. It is clear that the serial
algorithm does more work than the parallel algorithm.
Resource-based superlinearity
The higher aggregate cache/memory bandwidth
can result in better cache-hit ratios, and therefore
super-linearity.
Example:
A processor with 64KB of cache yields an 80% hit ratio. If
two processors are used, since the problem size/processor is
smaller, the hit ratio goes up to 90%. Of the remaining 10%
access, 8% come from local memory and 2% from remote
memory. If DRAM access time is 100 ns, cache access time
is 2 ns, and remote memory access time is 400ns, this
corresponds to a speedup of 2.43!
Speedup Curves

Superlinear Speedup
Speedup

Linear Speedup

Typical Speedup

Number of Processors
Efficiency
• Speedup does not measure how efficiently the
processors are being used
– Is it worth using 100 processors to get a speedup of 2?
• Efficiency is defined as the ratio of speedup and
number of processors required to achieve it
s
e
p
– The efficiency is bounded from above by 1, measuring the
fraction of time when a processor is usefully employed
• In an ideal case, as s=p, it comes that e=1
Parallel Time, Speedup, and
Efficiency Example
Consider the problem of edge-detection in images. The
problem requires us to apply a 3 x 3 template to each
pixel. If each convolution (multiply-add) takes time tc,
the serial time for an n x n image is given by TS= 9tc n2.

Example : (a) an 8 x 8 image; (b) typical templates for detecting


edges; and (c) partitioning of the image across four processors
with shaded regions indicating image data that must be
communicated from neighboring processors to processor 1.
Parallel Time, Speedup, and
Efficiency Example (cont.)
• One possible parallelization partitions the image
equally into vertical segments, each with n2 / p pixels.
• The boundary of each segment is formed by 2 layers
of n pixels. This is also the number of pixel values
that will have to be communicated. Assuming each
pixel is stored in a word, comm. takes time 2(ts + twn)
• Templates can be applied to all n2 / p pixels in time:
TS = 9 tcn2 / p.
Parallel Time, Speedup, and
Efficiency Example (continued)
• The total time for the algorithm is therefore given by:

• The corresponding values of speedup and efficiency


are given by:

and
Cost of a Parallel System
• Cost is the product of parallel runtime and the number
of processing elements used (p x TP )
• Cost reflects the sum of the time that each processing
element spends solving the problem.
• A parallel system is said to be cost-optimal if the cost of
solving a problem on a parallel computer is
asymptotically identical to serial cost
• Since E = TS / p TP, for cost optimal systems, E = O(1).
• Cost is sometimes referred to as work or processor-time
product
Cost Example
Consider the problem of adding numbers on p
processors.
• For p = n , we have TP = log n
• The cost of this system is given by pTP = n log n.
• Since the serial runtime of this operation is Θ(n), the
algorithm is not cost optimal.
Impact of Non-Cost Optimality
Consider a sorting algorithm that uses n processing
elements to sort the list in time (log n)2.
• Since the serial runtime of a (comparison-based) sort is
n log n, speedup and efficiency of this algorithm are
given by n / log n and 1 / log n, respectively.
• The p TP product of this algorithm is n (log n)2.
• This is not cost optimal but only by a factor of log n.
• If p < n, assigning n tasks to p processors gives TP = n
(log n)2 / p .
• The corresponding speedup is p / log n, that goes down
as the problem size n is increased for a given p !
Redundancy
• Hardware redundancy: more processors are employed
for a single application, at least one acting as standby
– Very costly, but often very effective, solution
• Redundancy can be planned at a finer grain
– Individual servers can be replicated
– Redundant hardware can be used for non-critical activities
when no faults are present
• Software redundancy: software must be designed so
that the state of permanent data can be recovered or
“rolled back” when a fault is detected
Granularity of Parallelism
• Given by the average size of a sequential component
in a parallel computation
• Independent parallelism: independent processes. No
need to synchronize.
• Coarse-grained parallelism: relatively independent
processes with occasional synchronization.
• Medium-grained parallelism. E.g. multi-threads
which synchronize frequently.
• Fine-grained parallelism: synchronization every few
instructions.
Granularity: Effect on Performance
• Often, using fewer processors improves performance
• Using fewer than the maximum possible number of
processing elements to execute a parallel algorithm is
called scaling down a parallel system.
• A naive way to scale down is to think of each processor
in the original case as a virtual processor and to assign
virtual processors equally to scaled down processors.
• Since the number of processing elements decreases by
a factor of n / p, the computation at each processing
element increases by a factor of n / p.
• The communication cost should not increase by this
factor since some of the virtual processors assigned to a
physical processors might talk to each other.
• This is the improvement from building granularity.
Scaling Characteristics of Parallel
Programs
• The efficiency of a parallel program can be written as:

or

• The total overhead To is an increasing function of p


• For a given problem size (TS constant), as we increase
the number of processing elements, To increases
• The overall efficiency of the parallel program goes
down. This is the case for all parallel programs
Scalability of Parallel Programs
• The total overhead function To is a function of both
problem size Ts and the no. of processing elements p
• In many cases, To grows sublinearly with respect to Ts
• In such cases, the efficiency increases if the problem
size is increased keeping the number of processing
elements constant.
• For such systems, we can simultaneously increase the
problem size and number of processors to keep
efficiency constant.
• We call such systems scalable parallel systems.
Efficiency vs. No of
Processors and Scalability
• The efficiency of a parallel system may even increase if problem
size is increased while keeping constant the number of processors

Variation of efficiency: (a) as the number of processing elements is increased for


a given problem size; and (b) as the problem size is increased for a given number
of processing elements. Case (b) is not common to all parallel systems.
Isoefficiency Metric of Scalability
• Before we formalize this rate, we define the problem
size W: it is the asymptotic number of operations that
may be associated with the best serial algorithm to
solve the problem (is assimilated with a time metric)
• The rate at which the problem size must increase
with respect to the number of processing elements to
keep the efficiency fixed determines the system
scalability
• The slower this rate, the better.
Isoefficiency Metric of Scalability
• We can write parallel runtime as:

• The resulting expression for speedup is

• Finally, we write the expression for efficiency as


Isoefficiency Metric of Scalability
• For scalable parallel systems, efficiency can be maintained fixed
(between 0 and 1) if the ratio To / W is maintained constant
• For a desired value E of efficiency,

• If K = E / (1 – E) is a constant depending on the efficiency to be


maintained, since To is a function of W and p, we have :
Isoefficiency Metric of Scalability
• The problem size W can usually be obtained as a
function of p by algebraic manipulations to keep
efficiency constant.
• This function is called the isoefficiency function and
determines the ease with which a parallel system can
maintain a constant efficiency
• In other words, it can achieve speedups increasing in
proportion to the number of processing elements
Cost-Optimality and Isoefficiency
• A parallel system is cost-optimal if and only if

• From this, we have:

• If we have an isoefficiency function f(p), then the


relation W = Ω(f(p)) must be satisfied to ensure the
cost-optimality of a parallel system as it is scaled up.
Lower Bounds on Isoefficiency
• For a problem with W units of work, no more than W
processing elements can be used cost-optimally
• The problem size must increase at least as fast as Θ(p)
to maintain fixed efficiency; hence, Ω(p) is the
asymptotic lower bound on the isoefficiency function
• The maximum number of tasks that can be executed
simultaneously at any time in a parallel algorithm is
called its degree of concurrency, or parallelism, C(W)
• For a problem of size W, no more than C(W)
processing elements can be employed effectively.
Variation of the Degree of
Parallelism (Concurrency)
• Is given by the number of operations which can be
scheduled for simultaneous (parallel) execution
• For pipeline parallelism, where data is vector –
shaped, the degree is coincident with vector size
(length)
• It may be constant throughout the steps of an
algorithm, but most often it varies
• It’s best illustrated by that representation of parallel
computations that uses DAGs
DAGs (Directed Acyclic Graphs)
- very simple, yet powerful tools -
 In parallel programming there is a large gap:
Problem Structure <--…………….--> Solution Structure
 We may try an intermediate step:
Problem ---> Directed Acyclic Graph (DAG) ---> Solution
(Particular DAGs are the so-called Task Graphs)

Problem ---> DAG:


split problem into tasks
DAG ---> Solution:
map tasks to parallel architecture
What Is A Task Graph?
A graph which has:
1 root, 1 leaf, no cycles and all nodes connected

A,B,C and D
are graphs but
they are not task
A
D graphs
B C
Total or Partial Ordering?

 A DAG (or task graph) may represent a collection of


tasks that can be ordered into a sequence
 Tasks are subject to the constraint that certain tasks
must be performed earlier than others
 The representation uses a DAG with a vertex for
each task and an edge for each constraint
 Algorithms for topological ordering may be used to
generate a valid sequence
 Is this sequence unique? *
Task Dependencies
 directed acyclic graph capturing causal dependency between tasks

 a task corresponding to a node can be executed only after all tasks on


the other sides of the incoming edges have already been executed

sequential
summation, binary summation,
traversal, … merge sort, …
How to go from Problem ---> Task Graph?

 The standard algorithm to create task graphs :


1. Divide problem into set of n tasks
2. Every task becomes a node in the task graph
3. If task(xx) cannot start before task(yy) has finished
then draw a line from node(yy) to node(xx)
4. Identify (or create) starting and finishing tasks
 The process (execution) flows through the task
graph like pipelining in a single processor system
A Complete Parallel Algorithmic
Design Model (Foster)
Involves all of the following:
1. identifying the portions of the work that can be
performed concurrently
2.mapping the concurrent pieces of work onto multiple
processes running in parallel
3.distributing the input, output and intermediate data
associated with the program
4.managing access to data shared by multiple processes
5.synchronizing the processes in various stages of parallel
program execution
Optimal choices depend on the parallel architecture.
The goal is to attain good performance in all stages.
Foster’s 5-Step Guide to Parallelization
Identify computational hotspots
• find what is worth to parallelize
Partition the problem into smaller semi-independent tasks
• find/create parallelism
Identify Communication requirements between these tasks
• realize the constraints communication puts on parallelism
Agglomerate smaller tasks into larger tasks
• group the basic tasks together so that the communication is
minimized, while still allowing good load balancing properties
Translate (map) tasks/data to actual processors
• balance the load of processors, while trying to minimize
communication
Asymptotic Analysis of Parallel
Programs
For the problem of sorting a list of n numbers, the fastest serial
programs for this problem run in time Θ(n log n). Consider four
parallel algorithms, A1, A2, A3, and A4 compared as follows:

The table shows the number of processing elements, parallel runtime, speedup,
efficiency and cost (the pTP product)
Which Parallel Program is Best?
• If the metric is speed, algorithm A1 is the best,
followed by A3, A4, and A2 (in order of
increasing TP).
• In terms of efficiency, A2 and A4 are the best,
followed by A3 and A1.
• In terms of cost, algorithms A2 and A4 are cost
optimal, A1 and A3 are not.
• It is important to identify the objectives of analysis
and to use appropriate metrics!
Other Scalability Metrics
• A number of other metrics have been proposed,
dictated by specific needs of applications.
• For real-time applications, the objective is to scale
up a system to accomplish a task in a specified
time bound.
• In memory constrained environments, metrics
operate at the limit of memory and estimate
performance under this problem growth rate.
Scaled Speedup
• Speedup obtained when the problem size is increased
linearly with the number of processing elements.
• If scaled speedup is close to linear, the system is
considered scalable.
• If the isoefficiency is near linear, scaled speedup
curve is close to linear as well.
• If the aggregate memory grows linearly in p, scaled
speedup increases problem size to fill memory.
• Alternately, the size of the problem is increased
subject to an upper-bound on parallel execution time.
Serial Fraction f
• If the serial runtime of a computation can be divided
into a totally parallel and a totally serial component,
we have:

• From this, we have,


Serial Fraction: Example
Consider the problem of estimating the serial component
of the matrix-vector product.
We have:

or

Here, the denominator is the serial runtime and the


numerator is the overhead
Amdahl’s Law
• Used to compute an upper bound of speedup
• The speedup of a parallel algorithm is limited
by the percentage of operations which must be
performed sequentially
• Amdahl's Law assumes a fixed data set size,
and same % of overall serial execution time
• If the time taken to do the serial calculations is
some fraction σ of the total time ( 0 < σ  1)
– The parallelizable portion is 1- σ of the total
Amdahl’s Law
• Assuming linear speedup
– Tserial = σT1
– Tparallel = (1- σ)T1/N
• By substitution:
Speedup  1
( 1 - σ)
σ
N
Consequences of Amdahl’s Law
• Say we have a program containing 100
operations each of which take 1 time unit.
• Suppose σ=0.2, using 80 processors
– Speedup = 100 / (20 + 80/100) = 100 / 20.8 < 5
– A speedup of only 5 is possible no matter how
many processors are available
• So why bother with parallel computing?...
Just wait for a faster processor 
Limitations of Amdahl’s Law
• To avoid the limitations of Amdahl’s law:
– Concentrate on parallel algorithms with small serial
components
• Amdahl's Law has been criticized for ignoring real-
world overheads such as communication,
synchronization, thread management, as well as the
assumption of infinite-core processors
• It is not complete in that it does not take into account
problem size
• As the no. of processors increases, the amount of data
handled is likely to increase as well
Gustafson's Law
• If a parallel application using 32 processors is able to
compute a data set 32 times the size of the original,
does the execution time of the serial portion increase?
– It does not grow in the same proportion as the data set
– Real-world data suggests that the serial execution time will
remain almost constant
• Gustafson's Law, aka scaled speedup, considers an
increase in the data size in proportion to the increase in
the number of processors, computing the (upper bound)
speedup of the application, as if the larger data set could
be executed in serial
Gustafson's Law Formula
Speedup ≤ p + (1-p)∙s
where:
– p is the number of processors
– s is the percentage of serial execution time in the
parallel application for a given data set size
• Since the % of serial time within the parallel execution
must be known, a typical usage for this formula is to
compute the speedup of the scaled parallel execution
(larger data sets as the number of processors increases)
to the serial execution of the same sized problem
Comparative Results
• I.e., if 1% of execution time on 32 cores will be spent
in serial execution, the speedup of this application
over the same data set being run on a single core with
a single thread (assuming that to be possible) is:
Speedup ≤ 32 + (1-32)∙0.01 = 32 - 0.31 = 31.69
• Assuming the serial execution percentage to be 1%,
the equation for Amdahl's Law yields:
Speedup ≤ 1/(0.01 + (0.99/32)) = 24.43
• This is a false computation, however, since the given
% of serial time is relative to the 32-core execution
Memory Performance
• Capacity: How many bytes can be held
• Bandwidth: How many bytes can be
transferred per second
• Latency: How much time is needed to fetch
a word
• Routing delays: when data must be gathered
from different parts of memory
• Contention: resolved by memory blocking

You might also like